-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale frequency to suppress RCU CPU stall warning #67
base: master
Are you sure you want to change the base?
Conversation
67b96a5
to
7c0fccd
Compare
7c0fccd
to
9395054
Compare
#define SEMU_BOOT_TARGET_TIME 10 | ||
#endif | ||
|
||
bool boot_complete = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving boot_complete
variable into vm_t
for a more conceptually accurate design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move boot_complete
into vm_t
, all existing functions for semu_timer_t would need an additional vm_t
parameter. For example, semu_timer_get
would change to:
semu_timer_get(vm_t *vm, semu_timer_t *timer)
This change would indirectly require the areas that call this function to also pass in a vm_t
parameter. For instance, since semu_timer_get
is called within aclint_mtimer_update_interrupts
, the API of aclint_mtimer_update_interrupts
would also need to be updated to include vm_t
.
As this pattern continues, the API changes would proliferate significantly. Perhaps we could introduce a static bool pointer pointing to boot_complete
and assign its value during semu_timer_init
. This way, we would only need to modify the parameters of semu_timer_init
.
utils.c
Outdated
struct timespec start, end; | ||
clock_gettime(CLOCKID, &start); | ||
|
||
for (uint64_t loops = 0; loops < target_loop; loops++) | ||
clock_gettime(CLOCKID, &end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not get the idea behind this code snip. What did you update the value end
several times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is meant to measure the execution time of target_loop
times clock_gettime
call. In my understanding, the following code can achieve the same purpose:
struct timespec start, end;
clock_gettime(CLOCKID, &start);
for (uint64_t loops = 0; loops < target_loop - 1; loops++) {
struct timespec ts;
clock_gettime(CLOCKID, &ts);
}
clock_gettime(CLOCKID, &end);
However, in the for loop, the variable ts
is not used. Therefore, I simply replaced it with end
as the parameter of clock_gettime
. This way, when exiting the loop, the value of end
will be the time from the last execution of clock_gettime
, which can also achieve the purpose of measure the execution time of target_loop
times clock_gettime
call.
Then, by dividing target_loop
, the execution time of a single clock_gettime
call can be calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add comments to address the measurement for HRT syscall overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, you should wrap POSIX, macOS, and time utility in standard library.
I noticed that the time during boot process was far from the target time on macOS. I think this may cause by the hypothesis of " |
utils.c
Outdated
/* Perform 'target_loop' times calling the host HRT. */ | ||
for (uint64_t loops = 0; loops < target_loop; loops++) | ||
(void) host_time_ns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Accurate Benchmarking to eliminate loop overhead.
bdf4a62
to
cb265ae
Compare
Test under macOS + Apple M1:
|
Since the emulator currently operates using sequential emulation, the execution time for the boot process is relatively long, which can result in the generation of RCU CPU stall warnings. To address this issue, there are several potential solutions: 1. Scale the frequency to slow down emulator time during the boot process, thereby eliminating RCU CPU stall warnings. 2. During the boot process, avoid using 'clock_gettime' to update ticks and instead manage the tick increment relationship manually. 3. Implement multi-threaded emulation to accelerate the emulator's execution speed. For the third point, while implementing multi-threaded emulation can significantly accelerate the emulator's execution speed, it cannot guarantee that this issue will not reappear as the number of cores increases in the future. Therefore, a better approach is to use methods 1 and 2 to allow the emulator to set an expected time for completing the boot process. The advantages and disadvantages of the scale method are as follows: Advantages: - Simple implementation - Effectively sets the expected boot process completion time - Results have strong interpretability - Emulator time can be easily mapped back to real time Disadvantages: - Slower execution speed The advantages and disadvantages of the increment ticks method are as follows: Advantages: - Faster execution speed - Effectively sets the expected boot process completion time Disadvantages: - More complex implementation - Some results are difficult to interpret - Emulator time is difficult to map back to real time Based on practical tests, the second method provides limited acceleration but introduces some significant drawbacks, such as difficulty in interpreting results and the complexity of managing the increment relationship. Therefore, this commit opts for the scale frequency method to address this issue. This commit divides time into emulator time and real time. During the boot process, the timer uses scale frequency to slow down the growth of emulator time, eliminating RCU CPU stall warnings. After the boot process is complete, the growth of emulator time aligns with real time. To configure the scale frequency parameter, three pieces of information are required: 1. The expected completion time of the boot process 2. A reference point for estimating the boot process completion time 3. The relationship between the reference point and the number of SMPs According to the Linux kernel documentation: https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout The grace period for RCU CPU stalls is typically set to 21 seconds. By dividing this value by two as the expected completion time, we can provide a sufficient buffer to reduce the impact of errors and avoid RCU CPU stall warnings. Using 'gprof' for basic statistical analysis, it was found that 'semu_timer_clocksource' accounts for approximately 10% of the boot process execution time. Since the logic within 'semu_timer_clocksource' is relatively simple, its execution time can be assumed to be nearly equal to 'clock_gettime'. Furthermore, by adding a counter to 'semu_timer_clocksource', it was observed that each time the number of SMPs increases by 1, the execution count of 'semu_timer_clocksource' increases by approximately '2 * 10^8' With this information, we can estimate the boot process completion time as 'sec_per_call * SMPs * 2 * 10^8 * (100% / 10%)' seconds, and thereby calculate the scale frequency parameter. For instance, if the estimated time is 200 seconds and the target time is 10 seconds, the scaling factor would be '10 / 200'.
cb265ae
to
ee5506e
Compare
Test on Ubuntu Linux + eMAG (32-core Armv8-A):
|
Based on the output, I think it clearly that the assumption of " |
4fae085
to
e814185
Compare
@@ -56,6 +56,8 @@ OBJS := \ | |||
aclint.o \ | |||
$(OBJS_EXTRA) | |||
|
|||
LDFLAGS := -pg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -pg
instrumentation option might be too heavy. You should consider more lightweight ways.
Since the emulator currently operates using sequential emulation, the execution time for the boot process is relatively long, which can result in the generation of RCU CPU stall warnings.
To address this issue, there are several potential solutions:
clock_gettime
to update ticks and instead manage the tick increment relationship manually.For the third point, while implementing multi-threaded emulation can significantly accelerate the emulator's execution speed, it cannot guarantee that this issue will not reappear as the number of cores increases in the future. Therefore, a better approach is to use methods 1 and 2 to allow the emulator to set an expected time for completing the boot process.
The advantages and disadvantages of the scale method are as follows:
Advantages:
Disadvantages:
The advantages and disadvantages of the increment ticks method are as follows:
Advantages:
Disadvantages:
Based on practical tests, the second method provides limited acceleration but introduces some significant drawbacks, such as difficulty in interpreting results and the complexity of managing the increment relationship. Therefore, this commit opts for the scale frequency method to address this issue.
This commit divides time into emulator time and real time. During the boot process, the timer uses scale frequency to slow down the growth of emulator time, eliminating RCU CPU stall warnings. After the boot process is complete, the growth of emulator time aligns with real time.
To configure the scale frequency parameter, three pieces of information are required:
According to the Linux kernel documentation:
https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout
The grace period for RCU CPU stalls is typically set to 21 seconds. By dividing this value by two as the expected completion time, we can provide a sufficient buffer to reduce the impact of errors and avoid RCU CPU stall warnings.
Using gprof for basic statistical analysis, it was found that
semu_timer_clocksource
accounts for approximately 10% of the boot process execution time. Since the logic withinsemu_timer_clocksource
is relatively simple, its execution time can be assumed to be nearly equal toclock_gettime
.Furthermore, by adding a counter to$2 \times 10^8$ (see the table below).
semu_timer_clocksource
, it was observed that each time the number of SMPs increases by 1, the execution count ofsemu_timer_clocksource
increases by approximatelyWith this information, we can estimate the boot process completion time as
seconds, and thereby calculate the scale frequency parameter. For instance, if the estimated time is 200 seconds and the target time is 10 seconds, the scaling factor would be
10 / 200
.To calculate the proportion of
semu_timer_clocksource
using gprof, simply add the-pg
option during compilation. Then, terminate the emulator's operation immediately after the first switch to U mode (when the boot process is complete). This approach allows for a rough estimation of the proportion occupied bysemu_timer_clocksource
.Below is the gprof testing output:
CPU: 13th Gen Intel(R) Core(TM) i7-13700
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
And by adding a counter to
semu_timer_clocksource
, it becomes possible to calculate the relationship between SMPs and the number of timessemu_timer_clocksource
is called.Below is the testing output(13th Gen Intel(R) Core(TM) i7-13700):
semu_timer_total_ticks